Fix RDMA MPI #702

max-Hawkins · 2024-11-07T18:12:22Z

Description

Previously when trying to run with 'rdma_mpi' equal to true, I would get illegal buffer pointer values in the halo communication routines. After a small change that makes sense to me, CUDA-aware MPI now seems to work on both Delta and Phoenix.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Something else

Scope

This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

I built and ran 2 cases that differed only by the 'rdma_mpi' status on both Phoenix and Delta. On Delta, it was 3D_TGV and on Phoenix it was 2D_riemman test. h5diff on the resulting hd5f files of the last step showed no difference on Phoenix. Additionally, I took nsys reports on phoenix to show the lack of H2D/D2H copies.

First without RDMA_MPI, the code moves the GPU data to the CPU before MPI exchange:

With RDMA_MPI enabled, there are no data copies. This led to a marginal improvement in the RHS-MPI time for this case:

If your code changes any code source files (anything in `src/simulation`)

To make sure the code is performing as expected on GPU devices, I have:

Checked that the code compiles using NVHPC compilers
Checked that the code compiles using CRAY compilers
Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
Enclosed the new feature via nvtx ranges so that they can be identified in profiles
Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
Ran an Omniperf profile using ./mfc.sh run XXXX --gpu -t simulation --omniperf, and have attached the output file and plain text results to this PR.
Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

abbotts · 2024-11-07T18:35:40Z

Unfortunately the bounce pointers are there because CCE 18.0.0 has a bug that will crash the compiler using module arrays directly in the host_data directive. The bounce pointers should be valid too, so if they're failing on a different compiler then we probably have dueling compiler bugs we'll have to handle in a more complex way.

max-Hawkins · 2024-11-07T18:38:23Z

Unfortunately the bounce pointers are there because CCE 18.0.0 has a bug that will crash the compiler using module arrays directly in the host_data directive. The bounce pointers should be valid too, so if they're failing on a different compiler then we probably have dueling compiler bugs we'll have to handle in a more complex way.

Ah dang. That's unfortunate. I hoped there wasn't some Cray compiler reason for the way things were. Henry isn't in person at the hackathon or I'm sure he would've told me this. I'll just keep this change for my personal work then. Thanks!

codecov · 2024-11-07T21:34:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.04%. Comparing base (6bad379) to head (4eccce8).
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #702      +/-   ##
==========================================
+ Coverage   42.99%   43.04%   +0.05%     
==========================================
  Files          61       61              
  Lines       16053    16062       +9     
  Branches     1796     1798       +2     
==========================================
+ Hits         6902     6914      +12     
+ Misses       8167     8158       -9     
- Partials      984      990       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

max-Hawkins and others added 2 commits November 6, 2024 18:03

Fix rdma-mpi

bfcae6a

Remove unneeded variables

4eccce8

max-Hawkins requested a review from sbryngelson as a code owner November 7, 2024 18:12

max-Hawkins closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix RDMA MPI #702

Fix RDMA MPI #702

Uh oh!

max-Hawkins commented Nov 7, 2024

Uh oh!

abbotts commented Nov 7, 2024

Uh oh!

max-Hawkins commented Nov 7, 2024

Uh oh!

codecov bot commented Nov 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Fix RDMA MPI #702

Fix RDMA MPI #702

Uh oh!

Conversation

max-Hawkins commented Nov 7, 2024

Description

Type of change

Scope

How Has This Been Tested?

If your code changes any code source files (anything in src/simulation)

Uh oh!

abbotts commented Nov 7, 2024

Uh oh!

max-Hawkins commented Nov 7, 2024

Uh oh!

codecov bot commented Nov 7, 2024

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

If your code changes any code source files (anything in `src/simulation`)